Text analysis: fundamentals and sentiment analysis

MACS 30500
University of Chicago

February 27, 2017

Basic workflow for text analysis

  • Obtain your text sources
  • Extract documents and move into a corpus
  • Transformation
  • Extract features
  • Perform analysis

Obtain your text sources

  • Web sites
    • Twitter
  • Databases
  • PDF documents
  • Digital scans of printed materials

Extract documents and move into a corpus

  • Text corpus
  • Typically stores the text as a raw character string with metadata and details stored with the text

Transformation

  • Tag segments of speech for part-of-speech (nouns, verbs, adjectives, etc.) or entity recognition (person, place, company, etc.)
  • Standard text processing
    • Convert to lower case
    • Remove punctuation
    • Remove numbers
    • Remove stopwords
    • Remove domain-specific stopwords
    • Stemming

Extract features

  • Convert the text string into some sort of quantifiable measures
  • Bag-of-words model
    • Term frequency vector
    • Term-document matrix
    • Ignores context

Perform analysis

  • Basic
    • Word frequency
    • Collocation
    • Dictionary tagging
  • Advanced
    • Document classification
      • Supervised
      • Unsupervised
    • Corpora comparison
    • Topic modeling

Sentiment analysis

I am happy

Dictionaries

## # A tibble: 6,788 × 2
##           word sentiment
##          <chr>     <chr>
## 1      2-faced  negative
## 2      2-faces  negative
## 3           a+  positive
## 4     abnormal  negative
## 5      abolish  negative
## 6   abominable  negative
## 7   abominably  negative
## 8    abominate  negative
## 9  abomination  negative
## 10       abort  negative
## # ... with 6,778 more rows

Dictionaries

## # A tibble: 2,476 × 2
##          word score
##         <chr> <int>
## 1     abandon    -2
## 2   abandoned    -2
## 3    abandons    -2
## 4    abducted    -2
## 5   abduction    -2
## 6  abductions    -2
## 7       abhor    -3
## 8    abhorred    -3
## 9   abhorrent    -3
## 10     abhors    -3
## # ... with 2,466 more rows

Dictionaries

## # A tibble: 13,901 × 2
##           word sentiment
##          <chr>     <chr>
## 1       abacus     trust
## 2      abandon      fear
## 3      abandon  negative
## 4      abandon   sadness
## 5    abandoned     anger
## 6    abandoned      fear
## 7    abandoned  negative
## 8    abandoned   sadness
## 9  abandonment     anger
## 10 abandonment      fear
## # ... with 13,891 more rows

Dictionaries

## # A tibble: 10 × 2
##       sentiment     n
##           <chr> <int>
## 1         anger  1247
## 2  anticipation   839
## 3       disgust  1058
## 4          fear  1476
## 5           joy   689
## 6      negative  3324
## 7      positive  2312
## 8       sadness  1191
## 9      surprise   534
## 10        trust  1231

Measuring overall sentiment

tidytext

  • Tidy text format
  • Defined as one-term-per-row
  • Differs from the document-term matrix
    • One-document-per-row and one-term-per-column

@realDonaldTrump

Obtaining documents

## [1] "Using browser based authentication"
## Classes 'tbl_df', 'tbl' and 'data.frame':    1512 obs. of  16 variables:
##  $ text         : chr  "My economic policy speech will be carried live at 12:15 P.M. Enjoy!" "Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8" "#ICYMI: \"Will Media Apologize to Trump?\" https://t.co/ia7rKBmioA" "Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton "| __truncated__ ...
##  $ favorited    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ favoriteCount: num  9214 6981 15724 19837 34051 ...
##  $ replyToSN    : chr  NA NA NA NA ...
##  $ created      : POSIXct, format: "2016-08-08 15:20:44" "2016-08-08 13:28:20" ...
##  $ truncated    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ replyToSID   : logi  NA NA NA NA NA NA ...
##  $ id           : chr  "762669882571980801" "762641595439190016" "762439658911338496" "762425371874557952" ...
##  $ replyToUID   : chr  NA NA NA NA ...
##  $ statusSource : chr  "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" ...
##  $ screenName   : chr  "realDonaldTrump" "realDonaldTrump" "realDonaldTrump" "realDonaldTrump" ...
##  $ retweetCount : num  3107 2390 6691 6402 11717 ...
##  $ isRetweet    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ retweeted    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ longitude    : chr  NA NA NA NA ...
##  $ latitude     : chr  NA NA NA NA ...

Clean up the data

id source text created
762669882571980801 Android My economic policy speech will be carried live at 12:15 P.M. Enjoy! 2016-08-08 15:20:44
762641595439190016 iPhone Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 2016-08-08 13:28:20
762439658911338496 iPhone #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA 2016-08-08 00:05:54
762425371874557952 Android Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! 2016-08-07 23:09:08
762400869858115588 Android The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! 2016-08-07 21:31:46
762284533341417472 Android I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! 2016-08-07 13:49:29

When does he tweet?

How does he retweet?

How does he retweet?

Trump tweets

id source text created
762669882571980801 Android My economic policy speech will be carried live at 12:15 P.M. Enjoy! 2016-08-08 15:20:44
762641595439190016 iPhone Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 2016-08-08 13:28:20
762439658911338496 iPhone #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA 2016-08-08 00:05:54
762425371874557952 Android Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! 2016-08-07 23:09:08
762400869858115588 Android The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! 2016-08-07 21:31:46
762284533341417472 Android I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! 2016-08-07 13:49:29

Remove manual retweets

id source text created
762669882571980801 Android My economic policy speech will be carried live at 12:15 P.M. Enjoy! 2016-08-08 15:20:44
762641595439190016 iPhone Join me in Fayetteville, North Carolina tomorrow evening at 6pm. Tickets now available at: https://t.co/Z80d4MYIg8 2016-08-08 13:28:20
762439658911338496 iPhone #ICYMI: “Will Media Apologize to Trump?” https://t.co/ia7rKBmioA 2016-08-08 00:05:54
762425371874557952 Android Michael Morell, the lightweight former Acting Director of C.I.A., and a man who has made serious bad calls, is a total Clinton flunky! 2016-08-07 23:09:08
762400869858115588 Android The media is going crazy. They totally distort so many things on purpose. Crimea, nuclear, “the baby” and so much more. Very dishonest! 2016-08-07 21:31:46
762284533341417472 Android I see where Mayor Stephanie Rawlings-Blake of Baltimore is pushing Crooked hard. Look at the job she has done in Baltimore. She is a joke! 2016-08-07 13:49:29

Tokenize

id source created word
676494179216805888 iPhone 2015-12-14 20:09:15 record
676494179216805888 iPhone 2015-12-14 20:09:15 of
676494179216805888 iPhone 2015-12-14 20:09:15 health
676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain
676494179216805888 iPhone 2015-12-14 20:09:15 #trump2016
676509769562251264 iPhone 2015-12-14 21:11:12 another

Remove stop words

id source created word
676494179216805888 iPhone 2015-12-14 20:09:15 record
676494179216805888 iPhone 2015-12-14 20:09:15 health
676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain
676494179216805888 iPhone 2015-12-14 20:09:15 #trump2016
676509769562251264 iPhone 2015-12-14 21:11:12 accolade
676509769562251264 iPhone 2015-12-14 21:11:12 @trumpgolf

Frequency of tokens

Assessing word importance

  • Term frequency (tf)
  • Inverse document frequency (idf)
  • Term frequency-inverse document frequency (tf-idf)

    \[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]

Calculate tf-idf

## # A tibble: 3,235 × 3
##     source                   word     n
##      <chr>                  <chr> <int>
## 1   iPhone             #trump2016   171
## 2  Android                hillary   124
## 3   iPhone #makeamericagreatagain    95
## 4  Android                crooked    93
## 5  Android                clinton    66
## 6  Android                 people    64
## 7   iPhone                hillary    52
## 8  Android                   cruz    50
## 9  Android                    bad    43
## 10  iPhone                america    43
## # ... with 3,225 more rows

Calculate tf-idf

## # A tibble: 2 × 2
##    source total
##     <chr> <int>
## 1 Android  4901
## 2  iPhone  3852

Calculate tf-idf

## # A tibble: 3,235 × 4
##     source                   word     n total
##      <chr>                  <chr> <int> <int>
## 1   iPhone             #trump2016   171  3852
## 2  Android                hillary   124  4901
## 3   iPhone #makeamericagreatagain    95  3852
## 4  Android                crooked    93  4901
## 5  Android                clinton    66  4901
## 6  Android                 people    64  4901
## 7   iPhone                hillary    52  3852
## 8  Android                   cruz    50  4901
## 9  Android                    bad    43  4901
## 10  iPhone                america    43  3852
## # ... with 3,225 more rows

Calculate tf-idf

## # A tibble: 3,235 × 7
##     source                   word     n total         tf       idf
##      <chr>                  <chr> <int> <int>      <dbl>     <dbl>
## 1   iPhone             #trump2016   171  3852 0.04439252 0.0000000
## 2  Android                hillary   124  4901 0.02530096 0.0000000
## 3   iPhone #makeamericagreatagain    95  3852 0.02466251 0.6931472
## 4  Android                crooked    93  4901 0.01897572 0.0000000
## 5  Android                clinton    66  4901 0.01346664 0.0000000
## 6  Android                 people    64  4901 0.01305856 0.0000000
## 7   iPhone                hillary    52  3852 0.01349948 0.0000000
## 8  Android                   cruz    50  4901 0.01020200 0.0000000
## 9  Android                    bad    43  4901 0.00877372 0.0000000
## 10  iPhone                america    43  3852 0.01116303 0.0000000
## # ... with 3,225 more rows, and 1 more variables: tf_idf <dbl>

Which terms have a high tf-idf?

## # A tibble: 3,235 × 6
##     source                   word     n          tf       idf      tf_idf
##      <chr>                  <chr> <int>       <dbl>     <dbl>       <dbl>
## 1   iPhone #makeamericagreatagain    95 0.024662513 0.6931472 0.017094751
## 2   iPhone                   join    42 0.010903427 0.6931472 0.007557680
## 3   iPhone          #americafirst    27 0.007009346 0.6931472 0.004858508
## 4   iPhone             #votetrump    23 0.005970924 0.6931472 0.004138729
## 5   iPhone             #imwithyou    20 0.005192108 0.6931472 0.003598895
## 6   iPhone        #crookedhillary    17 0.004413292 0.6931472 0.003059061
## 7   iPhone          #trumppence16    14 0.003634476 0.6931472 0.002519227
## 8   iPhone                    7pm    11 0.002855659 0.6931472 0.001979392
## 9   iPhone                  video    11 0.002855659 0.6931472 0.001979392
## 10 Android                  badly    13 0.002652520 0.6931472 0.001838587
## # ... with 3,225 more rows

Which terms have a high tf-idf?

Sentiment analysis

## # A tibble: 13,901 × 2
##           word sentiment
##          <chr>     <chr>
## 1       abacus     trust
## 2      abandon      fear
## 3      abandon  negative
## 4      abandon   sadness
## 5    abandoned     anger
## 6    abandoned      fear
## 7    abandoned  negative
## 8    abandoned   sadness
## 9  abandonment     anger
## 10 abandonment      fear
## # ... with 13,891 more rows

Sentiment analysis

## # A tibble: 1,172 × 3
##                    id  source total_words
##                 <chr>   <chr>       <int>
## 1  676494179216805888  iPhone        3852
## 2  676509769562251264  iPhone        3852
## 3  680496083072593920 Android        4901
## 4  680503951440121856 Android        4901
## 5  680505672476262400 Android        4901
## 6  680734915718176768 Android        4901
## 7  682764544402440192  iPhone        3852
## 8  682792967736848385  iPhone        3852
## 9  682805320217980929  iPhone        3852
## 10 685490467329425408 Android        4901
## # ... with 1,162 more rows

Sentiment analysis

## # A tibble: 6 × 4
##    source    sentiment total_words words
##     <chr>        <chr>       <int> <dbl>
## 1 Android        anger        4901   321
## 2 Android anticipation        4901   256
## 3 Android      disgust        4901   207
## 4 Android         fear        4901   268
## 5 Android          joy        4901   199
## 6 Android     negative        4901   560

Is this significant?

## # A tibble: 10 × 9
##       sentiment estimate statistic      p.value parameter  conf.low
##           <chr>    <dbl>     <dbl>        <dbl>     <dbl>     <dbl>
## 1         anger 1.492863       321 2.193242e-05  274.3619 1.2353162
## 2  anticipation 1.169804       256 1.191668e-01  239.6467 0.9604950
## 3       disgust 1.677259       207 1.777434e-05  170.2164 1.3116238
## 4          fear 1.560280       268 1.886129e-05  225.6487 1.2640494
## 5           joy 1.002605       199 1.000000e+00  198.7724 0.8089357
## 6      negative 1.692841       560 7.094486e-13  459.1363 1.4586926
## 7      positive 1.058760       555 3.820571e-01  541.4449 0.9303732
## 8       sadness 1.620044       303 1.150493e-06  251.9650 1.3260252
## 9      surprise 1.167925       159 2.174483e-01  148.9393 0.9083517
## 10        trust 1.128482       369 1.471929e-01  350.5114 0.9597478
## # ... with 3 more variables: conf.high <dbl>, method <fctr>,
## #   alternative <fctr>

Is this significant?

Most important words

Wordclouds

#rstats

Pope Francis

Pope Francis vs. President Trump

Acknowledgments